Meta Random Forests
نویسندگان
چکیده
Leo Breimans Random Forests (RF) is a recent development in tree based classifiers and quickly proven to be one of the most important algorithms in the machine learning literature. It has shown robust and improved results of classifications on standard data sets. Ensemble learning algorithms such as AdaBoost and Bagging have been in active research and shown improvements in classification results for several benchmarking data sets with mainly decision trees as their base classifiers. In this paper we experiment to apply these Meta learning techniques to the random forests. We experiment the working of the ensembles of random forests on the standard data sets available in UCI data sets. We compare the original random forest algorithm with their ensemble counterparts and discuss the results. Keywords— Random Forests [RF], ensembles, UCI. I. PROBLEM DOMAIN ANDOM Forests (RF) [1] are one of the most successful tree based classifiers. It has proven to be fast, robust to noise, and offers possibilities for explanation and visualization of its output. In the random forest method, a large number of classification trees are grown and combined. Statistically speaking two elements serve to obtain a random forest resampling and random split selection. Resampling is done here by sampling multiple times with replacement from the original training data set. Thus in the resulting samples, a certain event may appear several times, and other events not at all. About 2/3 of the data in the training sample are taken for each bootstrap sample and the remaining one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance. The design of random forests is to give the user a good deal of information about the data besides an accurate prediction. Much of this information comes from using the oob cases in the training set that have been left out of the bootstrapped training set. Random split selection is used in each trees growing process. It is computationally effective and offer good prediction performance. It generates an internal unbiased estimate of the generalization. It has an effective method for Manuscript received September 30, 2005. Praveen Boinee is the PhD Student in Computer Science in Udine University, Udine, 33100, Italy (phone: 0039-0432-558231; e-mail: [email protected]). Alessandro De Angelis is the Professor in Experimental and Computational Physics at Udine University, Udine, 33100, Italy (e-mail: [email protected]). Gian Luca Foresti is the Professor in Computer Science at Udine University, Udine, Italy (email: [email protected]). estimating missing data and maintains accuracy when a large proportion of the data are missing. It generates an internal unbiased estimate of the generalization error as the forest building progresses and thus does not over fit. These capabilities of RF can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection. Several authors have noted that constructing ensembles of base learners can significantly improve the performance of learning. Bagging, boosting, are the most popular examples of this methodology. The success of ensemble methods is usually explained with the margin and correlation of base classifiers [13]. To have a good ensemble one needs base classifiers which are diverse (in a sense that they predict differently), yet accurate. The ensemble mechanism which operates on the top of base learners then ensures highly accurate predictions. Here we experiment with random forests as themselves as the base classifiers for making ensembles and test the performance of the model. The ensembles are applied on UCI standard data sets and compared with the original random forest algorithm. The paper is organized as follows. In section II we introduce the decision tress the bases for constructing the random forests. Section III introduces the actual random forests algorithm. Section IV discusses the ensemble learning and making of bagged and boosted random forests. The experiments with UCI data sets are described in section V. Results are discussed in Section VI. II. DECISION TREES – A BASE FOR RANDOM FORESTS The decision-tree representation is the most widely used logic method for efficiently producing classifiers from the data. There is a large number of decision-tree induction algorithms described primarily in the machine-learning and applied-statistics literature. The decision tree algorithm is well known for its robustness and learning efficiency with its learning time complexity of O(nlog2n). The output of the algorithm is a decision tree, which can be easily represented as a set of symbolic rules (IF...THEN). The symbolic rules can be directly interpreted and compared with the existing domain knowledge, providing the useful information for the domain experts. A typical decision-tree learning system adopts a top-down strategy that searches for a solution in a part of the search space. It guarantees that a simple, but not necessarily the simplest, tree will be found. A decision tree consists of nodes that where attributes are tested. The outgoing branches of a node correspond to all the possible outcomes of the test at the Meta Random Forests Praveen Boinee, Alessandro De Angelis, and Gian Luca Foresti R World Academy of Science, Engineering and Technology International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:2, No:6, 2008 2246 International Scholarly and Scientific Research & Innovation 2(6) 2008 scholar.waset.org/1999.4/3799 In te rn at io na l S ci en ce I nd ex , C om pu te r an d In fo rm at io n E ng in ee ri ng V ol :2 , N o: 6, 2 00 8 w as et .o rg /P ub lic at io n/ 37 99 node. A simple decision tree for classification of samples with two input attributes X and Y is given Fig. 1. Fig. 1 A simple decision tree with the tests on attributes X and Y All samples with feature values X>1 and Y=B belong to Class2, while the samples with values X<1 belong to Class1, whatever the value for feature Y. The samples, at a nonleaf node in the tree structure, are thus partitioned along the branches and each child node gets its corresponding subset of samples. Decision trees that use univariate splits have a simple representational form, making it relatively easy for the user to understand the inferred model; at the same time, they represent a restriction on the expressiveness of the model. In general, any restriction on a particular tree representation can significantly restrict the functional form and thus the approximation power of the model. A well-known treegrowing algorithm for generating decision trees based on univariate splits is Quinlan's ID3 with an extended version called C4.5 [6]. Greedy search methods, which involve growing and pruning decision-tree structures, are typically employed in these algorithms to explore the exponential space of possible models and to remove unnecessary preconditions and duplication. C4.5 applies a divide and conquers strategy to construct the tree. The sets of instances are accompanied by a set of properties. A decision tree is a tree where each node is a test on the values of an attribute, and the leaves represent the class of an instance that satisfies the tests. The tree will return a ‘yes’ or ‘no’ decision when the sets of instances are tested on it. Rules can be derived from the tree by following a path from the root to a leaf and using the nodes along the path as preconditions for the rule, to predict the class at the leaf. For developing random forests, we use the trees that randomly choose a subset of attributes at each mode. III. RANDOM FORESTS A random forest is a classifier consisting of a collection of tree structures classifiers ( ) { } ,... 1 , , = Θ k x h k where the { } k Θ are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x. The forest chooses the classification having the most votes over all the trees in the forest. Each tree is grown as follows: 1. If the number of cases in the training set is N, sample N cases at random but with replacement, from the original data. This sample will be the training set for growing the tree. 2. If there are M input variables, a number m<1
منابع مشابه
Meta-Classifiers Easily Improve Commercial Sentiment Detection Tools
In this paper, we analyze the quality of several commercial tools for sentiment detection. All tools are tested on nearly 30,000 short texts from various sources, such as tweets, news, reviews etc. The best commercial tools have average accuracy of 60%. We then apply machine learning techniques (Random Forests) to combine all tools, and show that this results in a meta-classifier that improves ...
متن کاملOn a Few Recent Developments in Meta-Learning for Algorithm Ranking and Selection
This talk has two main parts. The first part will focus on the use of pair-wise meta-rules for algorithm ranking and selection. Such rules can provide interesting insights on their own, but they are also very valuable features for more sophisticated schemes like Random Forests. A hierarchical variant is able to address complexity issues when the number of algorithms to compare is substantial. T...
متن کاملSwiss-Chocolate: Combining Flipout Regularization and Random Forests with Artificially Built Subsystems to Boost Text-Classification for Sentiment
We describe a classifier for predicting message-level sentiment of English microblog messages from Twitter. This paper describes our submission to the SemEval2015 competition (Task 10). Our approach is to combine several variants of our previous year’s SVM system into one meta-classifier, which was then trained using a random forest. The main idea is that the meta-classifier allows the combinat...
متن کاملRandom forests algorithm in podiform chromite prospectivity mapping in Dolatabad area, SE Iran
The Dolatabad area located in SE Iran is a well-endowed terrain owning several chromite mineralized zones. These chromite ore bodies are all hosted in a colored mélange complex zone comprising harzburgite, dunite, and pyroxenite. These deposits are irregular in shape, and are distributed as small lenses along colored mélange zones. The area has a great potential for discovering further chromite...
متن کاملCompromising Multiple Objectives in Production Scheduling: A Data Mining Approach
In multi-objective scheduling problems, the objectives are usually in conflict. To obtain a satisfactory compromise and resolve the issue of NP-hardness, most existing works have suggested employing meta-heuristic methods, such as genetic algorithms. In this research, we propose a novel data-driven approach for generating a single solution that compromises multiple rules pursuing different obje...
متن کاملCoCoST: A Computational Cost Sensitive Classifier
Computational cost of classification is as important as accuracy in on-line classification systems. The computational cost is usually dominated by the cost of computing implicit features of the raw input data. Very few efforts have been made to design classifiers which perform effectively with limited computational power; instead, feature selection is usually employed as a pre-processing step t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005